POSCAT: A Morpheme-based Speech Corpus Annotation Tool
نویسندگان
چکیده
As more and more speech systems require linguistic knowledge to accommodate various levels of applications, corpora that are tagged with linguistic annotations as well as signal-level annotations are highly recommended for the development of today’s speech systems. Among the linguistic annotations, POS (part-of-speech) tag annotations are indispensable in speech corpora for most modern spoken language applications of morphologically complex agglutinative languages such as Korean. Considering the above demands, we have developed a single unified speech corpus annotation tool that enables corpus builders to link linguistic annotations to signal-level annotations using a morphological analyzer and a POS tagger as basic morpheme-based linguistic engines. Our tool integrates a syntactic analyzer, phrase break detector, grapheme-to-phoneme converter and automatic phonetic aligner together. Each engine automatically annotates its own linguistic and signal knowledge, and interacts with the corpus developers to revise and correct the annotations on demand. All the linguistic/phonetic engines were developed and merged with an interactive visualization tool in a client-server network communication model. The corpora that can be constructed using our annotation tool are multi-purpose and applicable to both speech recognition and text-tospeech (TTS) systems. Finally, since the linguistic and signal processing engines and user interactive visualization tool are implemented within a client-server model, the system loads can be reasonably distributed over several machines.
منابع مشابه
Integrating Linguistic and Signal Knowledge in a Morpheme Based Speech Corpus Annotation Tool
As more and more speech systems require high-level linguistic knowledge to accommodate various levels of applications, corpora that are tagged with high-level linguistic annotations as well as signal-level annotations are highly recommended for development of today's speech systems. Among the high-level linguistic annotations, POS (part-of-speech) tag annotations are indispensable in speech cor...
متن کاملStatistical Corpus Analysis for Kt{treasure : Korea Telecom Train Ticket Reservation Aid System Based upon Speech Recognition
This paper describes statistical analysis results of the corpus for KT{TREASURE (Korea Telecom Train ticket REservation Aid System based Upon speech REcognition). As the beginning of this development, two sets of speech corpus were collected. One was based on human-human(H-H) dialogues and the other was based on human-computer(H-C) dialogues. Wizard of Oz(WOZ) experiment was carried out to coll...
متن کاملA Korean speech corpus for train ticket reservation aid system based on speech recognition
This paper describes the Korean speech corpus for train ticket reservation aid system based on speech recognition. Two sets of speech corpus were collected. One was based on human-human(H-H) dialogues and the other was based on human-computer(H-C) dialogues. WOZ(Wizard of Oz) experiment was carried out to collect speech corpus based on H-C spoken dialogue. A total of 298 speaker data was collec...
متن کاملBUILDING A HEBREW TREE-BANK Building a Tree-Bank of Modern Hebrew Text
This paper describes the process of building the first tree-bank for Modern Hebrew texts. A major concern in this process is the need for reducing the cost of manual annotation by the use of automatic means. To this end, the joint utility of an automatic morphological analyzer, a probabilistic parser and a small manually annotated tree-bank was explored. An initial tree-bank that consists of 50...
متن کاملDaba: a model and tools for Manding corpora
This article provides a brief overview of Daba software package created in the course of building corpora for Manding languages. Key software features are motivated by the tasks and problems characteristic of many African languages. The corpus-building model proposed here was initially developed for Bambara Reference Corpus which is available online and is freely accessible. The morphological a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000